Weight Initialization in DL

01 — Core Idea

Why Starting Values Matter

Before a neural network learns anything, every weight must be given an initial value. This isn't just a technicality — it's the difference between a network that converges in minutes and one that never learns at all.

Think of it like tuning a guitar.

If every string starts wildly out of tune, the musician (optimizer) has to make huge corrections that may overshoot. If every string starts at the exact same note, you can't play a chord — you need variety. The ideal: each string starts close to its correct pitch, with just enough variation.

The golden rule: keep activation variance ≈ constant across every layer.

If variance shrinks layer by layer, gradients vanish (network stops learning). If variance grows, gradients explode (network becomes unstable). Good initialization keeps the signal stable as it flows forward and backward.

🪞

Symmetry

All weights identical → all neurons compute the same thing → network collapses to 1 neuron per layer.

📉

Vanishing Gradients

Weights too small → activations shrink exponentially through layers → deep layers never update.

💥

Exploding Gradients

Weights too large → activations blow up → loss becomes NaN, training crashes.

02 — See It Happen

How Signals Flow Through Layers

Watch what happens to the activation distribution as a signal passes through a 10-layer network with different init schemes. Each bar shows the spread (variance) of activations at that layer.

Activation Variance Across Layers

See how different initializations affect signal propagation

Init Scheme: Activation:

03 — Evolution

A Brief History

Pre-2010 — The Dark Ages

Small Random

Weights drawn from N(0, 0.01). Worked for 2–3 layer nets. Deeper? Gradients vanished — activations shrank to zero by layer 5.

2010 — Glorot & Bengio

Xavier / Glorot Init

Insight: balance the variance for both forward and backward passes. For a layer with n_in inputs and n_out outputs, average them:

Var(W) = 2 / (n_in + n_out)

Designed for sigmoid / tanh (approximately linear near zero). Broke for ReLU, which kills half the signal.

2015 — He et al.

He / Kaiming Init

ReLU zeros out ~50% of values, so variance drops by half each layer. Fix: double the variance.

Var(W) = 2 / n_in

That factor of 2 is the entire difference. Enabled ResNet-50 and beyond.

📐 Xavier Derivation (tap to expand)

+

1

Linear layer: y_j = Σ W_ji × x_i (sum over n_in inputs)

2

Assume inputs are i.i.d. with Var(x)=1, zero mean, and W independent of x.

3

Then Var(y) = n_in × Var(W) (variance of a sum of independent products).

4

Want Var(y) = 1 (same as input) → set Var(W) = 1/n_in.

5

Backward pass gives the same constraint but with n_out → Var(W) = 1/n_out.

6

Compromise: average both → Var(W) = 2/(n_in + n_out). Done!

📐 He Derivation — just one extra line (tap to expand)

+

1

Same setup: Var(z) = n_in × Var(W) (pre-activation).

2

After ReLU: half the values become zero → Var(ReLU(z)) ≈ ½ × Var(z).

3

So Var(y) = ½ × n_in × Var(W). Want this = 1.

4

Solve → Var(W) = 2/n_in. The factor of 2 compensates for ReLU's 50% kill rate.

5

Only uses n_in (fan-in), because ReLU's unbounded output makes forward-pass control more important.

04 — Decision Guide

Which Init Should You Use?

In practice, this is the flowchart that matters:

05 — Quick Reference

Modern Recipes at a Glance

Architecture	Initialization	Why
CNNs (ResNet)	He Normal	ReLU activations need the 2× factor
Transformers (GPT, BERT)	Xavier Normal	LayerNorm stabilizes; GELU ≈ linear near 0
ViT	Trunc Normal σ=0.02	Stabilizes patch embeddings
RNN / LSTM	Orthogonal + Xavier	Orthogonal prevents exploding through time
GAN Generator	N(0, 0.02)	Stabilizes fragile adversarial training
GAN Discriminator	He Normal	Leaky ReLU typical

Transformer layer-by-layer:

Embeddings → N(0, 0.02) | Q/K/V, FFN → Xavier | LayerNorm → γ=1, β=0 | Output proj (deep) → scale by 1/√num_layers

06 — Edge Cases

Don't Forget These

⚖️

Biases

Hidden layers → zero. Output bias for imbalanced classes → log(p / (1-p)) for faster convergence.

🔄

Norm Layers

γ=1, β=0 → starts as identity. Reduces sensitivity to weight init in earlier layers.

🧊

Transfer Learning

Keep pretrained weights. Only init new head (Xavier/He). Use tiny LR: 1e-5 to 1e-6.

🏗️

ReZero / Fixup

For 100+ layer nets: init residual branch scale α=0 so block starts as identity: y = x + 0·F(x).

07 — Code

PyTorch Cheat Sheet

PyTorch defaults (nn.Linear, nn.Conv2d) already use Kaiming Uniform. You usually only override for transformers, GANs, or LSTMs.

import torch.nn as nn

# He Normal (CNNs with ReLU)
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

# Xavier Normal (Transformers)
nn.init.xavier_normal_(layer.weight)

# Embeddings (GPT-style)
nn.init.normal_(layer.weight, mean=0, std=0.02)

# Truncated Normal (ViT)
nn.init.trunc_normal_(layer.weight, std=0.02)

# Orthogonal (RNN recurrent weights)
nn.init.orthogonal_(layer.weight)

# LSTM forget gate bias → 1 (keep memory early on)
n = bias.size(0)
bias[n//4:n//2].fill_(1.0)
  

08 — Interview Prep

Weight Initialization